We present SODA: the first publicly available, million-scale high-quality social dialogue dataset. Using SODA, we train COSMO: a generalizable conversation agent outperforming previous best-performing agents on both in- and out-of-domain datasets. In contrast to most existing crowdsourced, small-scale dialogue corpora, we distill 1.5M socially-grounded dialogues from a pre-trained language model (InstructGPT; Ouyang et al., 2022). Dialogues are distilled by contextualizing social commonsense knowledge from a knowledge graph (Atomic10x; West et al., 2022). Human evaluation shows that dialogues in SODA are more consistent, specific, and (surprisingly) natural than prior human-authored datasets - e.g., DailyDialog (Li et al., 2017), BlendedSkillTalk (Smith et al., 2020). In addition, extensive evaluations show that COSMO is significantly more natural and consistent on unseen datasets than best-performing dialogue models - e.g., GODEL (Peng et al., 2022), BlenderBot (Roller et al., 2021), DialoGPT (Zhang et al., 2020). Furthermore, it is sometimes even preferred to the original human-written gold responses. We make our data, models, and code public.
translated by 谷歌翻译
Variational autoencoders employ an amortized inference model to approximate the posterior of latent variables. However, such amortized variational inference faces two challenges: (1) the limited posterior expressiveness of fully-factorized Gaussian assumption and (2) the amortization error of the inference model. We present a novel approach that addresses both challenges. First, we focus on ReLU networks with Gaussian output and illustrate their connection to probabilistic PCA. Building on this observation, we derive an iterative algorithm that finds the mode of the posterior and apply full-covariance Gaussian posterior approximation centered on the mode. Subsequently, we present a general framework named Variational Laplace Autoencoders (VLAEs) for training deep generative models. Based on the Laplace approximation of the latent variable posterior, VLAEs enhance the expressiveness of the posterior while reducing the amortization error. Empirical results on MNIST, Omniglot, Fashion-MNIST, SVHN and CIFAR10 show that the proposed approach significantly outperforms other recent amortized or iterative methods on the ReLU networks.
translated by 谷歌翻译
360 $^\ circ $视频显着性检测是360 $^\ circ $视频理解的具有挑战性的基准之一,因为不可忽略的失真和不连续性发生在任何格式的360 $^\ circ $视频中,并捕​​获 - 并捕获 - 在全向球体中,值得的观点本质上是模棱两可的。我们提出了一个名为Panoramic Vision Transformer(摊铺机)的新框架。我们使用具有可变形卷积的Vision Transformer设计编码器,这不仅使我们不仅可以将正常视频介绍的模型插入我们的体系结构中,而无需其他模块或填充,而且只能执行一次几何近似,这与以前的基于CNN的深入基于CNN的方法不同。多亏了其功能强大的编码器,摊铺机可以通过本地补丁功能之间的三个简单相对关系来学习显着性,在没有监督或辅助信息(例如类激活)的情况下,通过大幅度的大幅度优于Wild360基准的最先进模型。我们通过VQA-ODV中的全向视频质量评估任务来证明我们的显着性预测模型的实用性,在这里,我们始终在没有任何形式的监督(包括头部运动)的情况下提高性能。
translated by 谷歌翻译
位移是评估结构条件的重要测量,但是与传感器安装和测量精度相关的困难通常会阻碍其现场测量。为了克服常规位移测量的缺点,由于其遥感功能和准确性,已经实施了基于计算机视觉(CV)的方法。本文提出了一种非目标结构位移测量的策略,该策略利用简历来避免在结构上安装目标的需求,同时使用结构性光对位移进行校准。所提出的称为Lavolution的系统使用四个等距的结构光的光束计算了相机在结构方面的相对位置,并获得了一个比例因子,以将像素运动转换为结构位移。设计了四个结构光束的夹具,并提出了相应的对齐过程。提出了一种使用设计的夹具来计算尺度因子的方法,并通过数值模拟和实验室规模实验验证了并验证。为了确认所提出的位移测量过程的可行性,进行了摇桌和全尺寸桥梁的实验,并将提出方法的精度与参考激光多普勒振动仪进行比较。
translated by 谷歌翻译
我们将神经激活编码(NAC)作为一种学习从未标记数据的深度表示的新方法,用于下游应用。我们认为深度编码器应在下游预测器的数据上最大化其非线性表征,以充分利用其代表性。为此,NAC通过嘈杂的通信信道通过嘈杂的通信信道最大化编码器的激活模式和数据之间的相互信息。我们表明,用于稳健激活码的学习增加了Relu编码器的不同线性区域的数量,因此是最大的非线性表达性。 NAC更有意义地了解数据的连续和离散表示,我们分别在两个下游任务中评估:(i)Cifar-10和Imagenet-1k和(ii)在CiFar-10和Flickr-25k上的最近邻检索的线性分类。经验结果表明,NAC在最近的基本链上获得了更好或相当的性能,包括SIMCLR和Distillhash。此外,NAC预押出了对深度生成模型的培训提供了显着的好处。我们的代码可在https://github.com/yookoon/nac提供。
translated by 谷歌翻译
在钢筋学习中,连续时间通常是通过时间缩放$ \ delta $离散的,所以已知产生的性能是高度敏感的。在这项工作中,我们寻求找到一个$ \ delta $-invariant算法,用于策略渐变(pg)方法,无论$ \ delta $的值如何,它会效果良好。我们首先确定导致PG方法失败的潜在原因作为$ \ delta \ 0美元,证明了PG估计器的方差在随机性的某些假设下随机环境中的无限远。虽然可以使用持续行动或动作重复来拥有$ \ delta $-invariance,但之前的操作重复方法不能立即对随机环境中的意外情况作出反应。因此,我们提出了一种新的$ \ delta $-invariant方法,命名为适用于任何现有的pg算法的安全操作重复(sar)。 SAR可以通过自适应地反应在行动重复期间的状态变化来处理环境的随机性。我们经验表明,我们的方法不仅是$ \ delta $-invariant,而且对随机性的强大,表现出以前的八个Mujoco环境中的前一\ \ delta $-invariant方法,具有确定性和随机设置。我们的代码在https://vision.snu.ac.kr/projects/sar上获得。
translated by 谷歌翻译
We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor. Then, the Convolutional Hierarchical Decoder computes their similarity score by discovering hidden hierarchical matches between the two sequence modalities. Both modules leverage hierarchical attention mechanisms that learn to promote well-matched representation patterns while prune out misaligned ones in a bottom-up manner. Although the JSFusion is a universal model to be applicable to any multimodal sequence data, this work focuses on video-language tasks including multimodal retrieval and video QA. We evaluate the JS-Fusion model in three retrieval and VQA tasks in LSMDC, for which our model achieves the best performance reported so far. We also perform multiple-choice and movie retrieval tasks for the MSR-VTT dataset, on which our approach outperforms many state-of-the-art methods.
translated by 谷歌翻译
The 3D-aware image synthesis focuses on conserving spatial consistency besides generating high-resolution images with fine details. Recently, Neural Radiance Field (NeRF) has been introduced for synthesizing novel views with low computational cost and superior performance. While several works investigate a generative NeRF and show remarkable achievement, they cannot handle conditional and continuous feature manipulation in the generation procedure. In this work, we introduce a novel model, called Class-Continuous Conditional Generative NeRF ($\text{C}^{3}$G-NeRF), which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator. The proposed $\text{C}^{3}$G-NeRF is evaluated with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows strong 3D-consistency with fine details and smooth interpolation in conditional feature manipulation. For instance, $\text{C}^{3}$G-NeRF exhibits a Fr\'echet Inception Distance (FID) of 7.64 in 3D-aware face image synthesis with a $\text{128}^{2}$ resolution. Additionally, we provide FIDs of generated 3D-aware images of each class of the datasets as it is possible to synthesize class-conditional images with $\text{C}^{3}$G-NeRF.
translated by 谷歌翻译
In both terrestrial and marine ecology, physical tagging is a frequently used method to study population dynamics and behavior. However, such tagging techniques are increasingly being replaced by individual re-identification using image analysis. This paper introduces a contrastive learning-based model for identifying individuals. The model uses the first parts of the Inception v3 network, supported by a projection head, and we use contrastive learning to find similar or dissimilar image pairs from a collection of uniform photographs. We apply this technique for corkwing wrasse, Symphodus melops, an ecologically and commercially important fish species. Photos are taken during repeated catches of the same individuals from a wild population, where the intervals between individual sightings might range from a few days to several years. Our model achieves a one-shot accuracy of 0.35, a 5-shot accuracy of 0.56, and a 100-shot accuracy of 0.88, on our dataset.
translated by 谷歌翻译
Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning (RL), but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This method is theoretically appealing but requires oracle access to the data distribution, so we develop a learning approach based on amortized optimization. The proposed method is shown to recover the greedy policy when trained to optimality and outperforms numerous existing feature selection methods in our experiments, thus validating it as a simple but powerful approach for this problem.
translated by 谷歌翻译